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Abstract 

This note introduce three Bayesian style Multi-armed bandit algorithms: Information-directed 
sampling, Thompson Sampling and Generalized Thompson Sampling. The goal is to give an 
intuitive explanation for these three algorithms and their regret bounds, and provide some 
derivations that are omitted in the original papers. 


1 Introduction 

A multi-armed bandit problem [T] is one of the sequential decision making problem. At each time the 
learner selects an action based on its current knowledge and arm-selection policy, and then receives 
reward of the action selected. Since the rewards of actions that are not selected are unknown, the 
learner needs to balance between exploit its current knowledge to select a best arm and explore 
potential best arms. In this note we describe three Bayesian style Multi-armed bandit algorithms: 
Information-Directed Sampling[2], Thompson Sampling[3] and Generalized Thompson Samphng[4]. 
Each of these three algorithms maintains a posterior distribution indicating the probability of each 
arm/policy being optimal. However they have different rules to update this posterior distribution 
based on observed rewards. 

2 Information-Directed Sampling 

2.1 Problem Formulation 

Information-Directed Sampling (IDS) [2] consider a Bayesian formulation of Multi-armed bandit 
problem. In this setting there is a set of actions (arms) A, and at time t G [Ij T the decision-maker 
chooses an action at ■ Action at then draws a reward from a reward distribution We assume 
that all rewards are i.i.d distributed and the reward distribution is stationary with respect to time 
tG [i,r]. 

To formulate Multi-armed bandit in a Bayesian way. We denote a* = aigTa&y.a&A^ra'^Pa[^a\-, 
which means a* is the arm with highest expected reward with respect to distribution pa, where 
a £ A. We also denote the reward drawn from pa* ■ The decision-maker do not know the real 

^In the original paper they assume that the arms will first draw an outcome from an outcome distribution, then 
here is a fixed and known function that maps outcomes to rewards. However here for the sake of simplicity, we assume 
the outcome is equal to the reward. 
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reward distribution pa, so it has its own estimate about these distributions at time step t, which 
we denote as pa^t- Because of this uncertainly, for each action a at time t, the decision-maker 
has a believe on whether this action has the highest expected reward. We denote this believe 
by at{a) = P{a* = a\Pt-i), where Pt-i is the history of past observations including the actions 
selected and the corresponding rewards. The decision-maker will update this posterior distribution 
at each time step based on Pt-i- 

Instead of sampling actions directly based on posterior distribution at, IDS sample actions 
based on a distribution tt. tt is also a distribution over all actions and is constructed based on the 
posterior distribution at- We are interested in the following expected regret 


T 

E[Regret(r)] = E ra* 


T 


IE y,ra,t 

a~ 7 r / ^ ’ 

ra,t^Pa t=l 


( 1 ) 


2.2 Algorithm 

In multi-armed bandit problem, we want to balance between exploitation and exploration. IDS 
handle this trade-off by defining immediate regret At {a) and information gain gt{a) of action a at 
time t. 

2.2.1 Immediate Regret 

The immediate regret At{a) is defined as 

At(a) = E [ra*,t\J^t-i] - E [ra,t\J^t-i] (2) 

a*^at ra,t'^Pa,t 

'^a* ,t^Pa* 

The idea behind this is that: the regret is defined by formula ([T|), however the decision-maker does 
not know the true pa* and pa for a E A, so it uses pa* and pa instead to estimate the regret at time 
step t. Note that 


So 


P{ra*,t = r) = P{ra,t = r\a* = a) 


E[ra*,t\Pt-i] = E[ra,t\rb,t < ra,t V5, J^-i] 
We will show how to calculate each of these terms in section Ea 


( 3 ) 

( 4 ) 


2.2.2 Information Gain 

Instead of doing pure exploitation using immediate regret, one would want to do some exploration 
to seek potential best arms. To do this, IDS defined a term: information gain, denoted as gt{a). 
The idea is that: we already have a posterior distribution over a*, we hope that after we pull one of 
the arms, the entropy of this distribution decreases, so that we gain a certain amount of information 
about which arm has the highest expected reward. Let a* ~ a* and ~ cti+i, and let H^a*) 
denote the entropy of at, then gt{a) is defined as 

gt{a) = E[H{al) - H{al^i)\Pt-i,at = a] (5) 


2 


The expectation is with respect to the random reward of arm a. To calculate this, one can sample 
reward from pa and then calculate the expectation above. However in the original paper they used 
the following way. 

From the property of mutual information we have: 

H{X)-H{X\Y) = I{X,Y) (6) 

and since K[H{a*^i)\Xt-i,at = o] = H{a*\ra,t), So 

gtia) = I{a*,ra,t) (7) 

Also from the property of mutual information we have: 

I{X,Y)=EDkl{P{Y\X)\\P{Y)) (8) 

Since we do not have the true distribution of we use the posterior distribution pa^t, and we 
have: 

9t{a) = E DKL{PaA-W')\\Pci,t) (9) 

In the equation above, pa,t is just the reward posterior distribution of arm a at time t, and pa^ti'W) 
is the reward posterior distribution conditioned on that a' is the arm that has the highest mean 
reward. With this condition, the reward posterior distribution has to shift to satisfy this constrain. 
For example in Figure [H we show 3 arms with mean reward as Gaussian distribution, suppose we 
want to calculate the reward posterior distribution of arm 2 and 3 conditioned on that arm 1 has 
the highest mean reward. We examine one point where the mean reward of arm 1 is 0.8. Then the 
mean reward of arm 2 and arm 3 cannot be greater than 0.8, so the probability mass of these two 
arms that is greater than 0.8 has to be cut off, and the remaining has to be normalized. 

2.2.3 Optimization 

The goal of IDS at a single time step is to balance immediate regret At (a) and information gain 
gt{a). There are many ways to do this, and in the paper the author choose the following way: 

= argmin^g^(_4) |^'t(7r) := | (10) 

Note that tt is a distribution over all arms, and assuming g has at least 1 non-zero elements, then 
to find 'I't(7r) it is equal to solve the following optimization problem: 

minimize 'I'(7r) := - (11) 

TT-^ g 

subject to vr'^e = 1 (12) 

vr > 0 (13) 

The author stated that tt can be very sparse, with only two non-zero elements, and then they try 
all possible combinations of two arms that gives the lowest 'I't(7r). Given vr, IDS sample an arm 
and pull that arm. I omit the detail here since it’s well described in the IDS paper. 
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Figure 1: Example oi Pa^ti'W) with 3 arms 



2.3 Bernoulli Bandit Experiment 

In a K-armed Bernoulli bandit problem, there are K arms, and the reward of the i-th arm follows 
a Bernoulli distribution with mean X^. In a Bayesian style learning algorithm, it is standard to 
model the mean reward of each arm using the Beta distribution: 

X,^Betaif3l(3f) (14) 

Tj ~ Bernoulli{Xi) (15) 

To calculate Xt{a) and gt{a), we first calculate at(a). Let /* = Beta.pdf {x\Pl, I3f) and F) = 
Beta.cdf {x\/3l, /3f) for all arm i, that is, fi and Fi are the PDF and CDF of the posterior distribution 
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of Xi, then to calculate af. 



where F{x) = Oili F'i{x). To calculate this integral, we need to sample points from fi, Fi and Fj, 
and then do summation, so it is quite time consuming. 

Next we need to calculate = o), which is the same as calculating Mjj := E\Xj\Xk < 

Xi Vfc] 


Mi, = E[Xj\Xk < Xi Vfe] 

= [ xP{Xj = x\Xk < Xi Vk) 

Jo 


L 


= / X 


1 p{Xj = x,Xk<Xiyk) 


p{Xk < Xi yk) 


dx 


Suppose i j, then 


at{i 

1 


- f'xPiXk 
i) Jo 


< XiMk ^ j, Xj = x,Xi > x)dx 


at(i) 

1 


xP{Xj = x)P{Xk < Xj V/c 7 ^ z or j, Xi > x)dx 


at{i 


- fxPiXi 

*1 Jo 


= x) P{Xk <yyk^i or j)P{Xi = y)dydx 


1 


at{i 

1 


' ^ fi{y)F{y) \ r 


Mi) Jo \Fi{y)Fj{y)J Jo 


mny) \ 

F^{y)F,{y)) 

xfj{x)dxdy 


dydx 


1 


fi{y)F{y) 

Mi) Jo \Fi{y)Fj{y) 


Qj{y)dy 


( 20 ) 

( 21 ) 

( 22 ) 

(23) 

(24) 

(25) 

(26) 

(27) 

(28) 


Where Qj{y) = Jq xfj{x)dx. To calculate Qj{y) we also need to do sampling and then summation. 
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Suppose i = j, then 



(29) 


(30) 


(31) 


Now that we have at{a) and = E[Xj\Xii. < Xi VA:], we can calculate At{a) and gt{a). 


K 



(32) 


(33) 


(34) 


Where KL{pi\\p 2 ) is defined as KL{pi\\p 2 ) = Pilog(^) + (1 — pi) log() since pa,t follows 
Bernoulli distribution. 

At each time step, we can calculate At{a) and gtia) by the above procedure and then solve the 
optimization problem to get tt, and sample an arm based on vr. 

2.4 Regret Bound 

Here we prove a general regret bound, for specific regret bound, we can refer to the IDS paper. For 
a fixed deterministic A G M and a policy vr such at 4't(7rt) < A, we have 


E[Regret(T, tt)] < a/ \H{ai)T 


(35) 


Prove: 


T 


T 


E^ gt{n) = E ^ E[H{at) - H{at+i)\Ft-i 


(36) 


t=l i=l 


T 


= E^(R(at)-^(«m)) 


(37) 


i=l 


= H[ai) — KH[aT+i) 
< H{ai) 


(38) 

(39) 
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By definition, 'I't(7r) < A, so At(7r) < 


so 


IE(Regret(T, tt)) = E At(7r) 


t=i 


t=i 


< Vat, 


T 


. E^^5t(7r) Caushy-Schwardsz inequality 
\ t=i 


< ^\H{ai)T 


(40) 

(41) 

(42) 

(43) 


In the paper, the author proved that 'hj' < |Al/2|, so E(Regret(T, vr^^'^)) < y^^|Al|-R(Q!i)T 


2.5 Potential Problems 

IDS showed a strong empirical results, however there are several potential problems. I think the 
main problem is that the algorithm is very time consuming as I run it, the reason is that it has 
3 integral to calculate so we have to evaluate each integrand at a discrete grid of points. Another 
problem is that the paper didn’t mention why they choose such format of fh as the trade-off between 
At and gt, since there are many ways to make this trade-off. Also it would be nice to see some 
generalization to contextual bandit. 


3 Thompson Sampling 

3.1 Problem Formulation 

Thompson sampling (TS) [3l|5] is also a Bayesian style bandit algorithm, it can apply to both con¬ 
textual bandit and standard Multi-armed bandit problems. Here we talk about the non-contextual 
version. Again, we assume there is an action set A, and at time step t Thompson sampling select 
action a and get reward ra^t- We also assume the reward of each arm follows some parametric 
distribution pa = P{r\a,9a) with mean /Tq, where 9a is the parameter. Define past observations 
T) consists of arms pulled and rewards observed. At the beginning, Thompson sampling assumes 
a prior distribution on parameters 9a, and then after each time step, it will update the posterior 
distribution P[9a\'D) based on past observations. Similar to IDS, the goal is to minimize the regret: 


T 

E[Regret(r)] = E Va* 

ra*'^Pa* ^ 


T 


E 

ra,t^Pa 


Y,ra,t 

t=l 


(44) 


where a* is the arm with the highest expected reward, and a is the arm selected by Thompson 
sampling. 
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3.2 Algorithm 

Similar to IDS, Thompson sampling randomly select an action a according to its probability of 
being optimal. So action a is chosen with probability 


/ 


I 


E{r\a,e) 


maxil^(r|a^, 9) 

a' 


P{9\D)de 


(45) 


Which is essential the same as the at in IDS. However calculating at is time consuming, and since 
in Thompson sampling, we do not need to use at explicitly, and we only need samples from at, so 
it suffices to draw a random parameter 0 from posterior distribution. Algorithm [T] describes the 
procedure of Thompson sampling with Bernoulli bandit problem. 


Algorithm 1 Thompson sampling with Bernoulli multi-armed bandit 
Require: a, /3: prior parameter of a Beta distribution 
For each arm z = 1,..., A set S'* = 0, F) = 0 
for t = 1,..., T do 

for arm i = 1,..., K do 

Draw 9i from Beta{a + St, /3 + Ft) 
end for 

Play arm a = argmaxj 0j, and observe reward rt 
if rt = 1 then Sa = Sa + l 
else Fa = Fa + I 

end if 
end for 


3.3 Regret 

Although Thompson sampling is a very old algorithm, proposed by [6], but the theoretical analysis 
is done very recently. We follow [5] and hope to give a intuitive explanation of the regret. Let 
/i* = maxj m and Aj = /i* — /ij, where i G A, and let ki{t) denote the number of times arm i has 
been played up to step t — 1. Then the expected total regret in time T -|- 1 can be written as 

E[Regret(r)] = ^ A,E(fei(T + 1)) (46) 

i 

Hence to bound the expected regret, we need to bound 'K{ki{T + 1)) for all i G A. 

To bound /cj(T -|- 1) we need the following settings [S]: Define Fj^p{-) the cdf and fn^p{-) the 
pdf of the binomial distribution with parameters n,p. Define T^^'^(-) the cdf of beta distribution 
with parameters a,/3. Let i{t) denote the arm played at time t, ki{t) denotes the number of plays 
of arm i until time t — 1, Sift) denote the number of successes among the plays of arm i until t — 1 
for the Bernoulli bandit case, jXfi) denote the empirical mean and Otif) denote the sample mean 
reward of arm i at time t. We assume the first arm is the unique optimal arm, i.e; p* = pi. For 
each arm i, we will choose two thresholds Xi and yt such that pi < xi < yt < pi. With different 
choices of xi and yt, we can get problem dependent and problem independent bound respectively. 
We also define Ff{t) as the event that fiift) < xt and Ffft) as the event that di{t) < yt. Finally, 







define - 1} and pi^t = P{0i{t) > yi\Pt-i) 

the probability of the sample reward of arm 1 is greater than pi at time t. 

. pi^t indicates what is 

We can decompose E(/cj(T + 1)) into 


T 

nh{T+i)] = Y,pm = i) 

t=i 

(47) 

T 

= j2pm = hEnt),E!{t)) 

t=i 

(48) 

+ Y,pm = hEnt),Efit)) 

t=i 

(49) 

T 

+ ^P{i{t)=i,Ei^{t)) 

(50) 


t=l 


So we need to bound (f48l) . (I4^ and (150]) respectively. To bound (l48]) . [5] proved that 

P{i{t)=i,E^{t),Et{t)\Tt.,) < = l^E>l{t),E<l{t)\Et-i) 

Pi,t 

and so 


T 




T 


i,E>i{t),E<l{t)) = Y.W.P{i{t) = i,E>l{t),E<l{t)\Ft-i) 

'(1 - Pm ) 


t=i 

T 

<Ee 


t=l 

T-1 


Pi,t 


-P{i{t) = l,E^{t),E<!{t)\F-i) 


< 




- 1 


k=0 LPi,rfc + l 


(51) 


(52) 

(53) 

(54) 


where denotes the time step at which arm 1 is played for the time. (|54p only involves 
Pi^rk+i because the posterior distribution of the parameters of arm 1 only changes when arm 1 
gets pulled. Now we need to bound (l54|) . Let ki{t) = j and Si{t) = s, from the fact that 
= 1 - - 1) we have pi,t = P{0i{t) > Vi) = Ef+i{s), and since 


S'i(t) ~ Binomial(fci(t),//i) (55) 

0i{t) ~ Beta(Si(t), ki{t) — Si{t)) (56) 

so each possible value Si{t) = s corresponding to a value of Pi,Tj+i = Ef+iy{s) with probability 

so 


E 


1 

.Pi,Tfc+l 



fB 

•'J.Mi 


(^) 


J 


(57) 


So we have reduced the problem of bounding (|54p to the problem of bounding a summation of a 
series of random variables involving binomial distribution. [5] provide details about how to bound 
(|57p . which is quite complicated. 
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Now we bound (j50p . Let Tk denote the time at which trial of arm i happens, and tq = 0. 
We have 




t=i 


T-l Tfc+l 


^ im = mE^{t)) 

k=0 i=Tfc+l 


= E 


= E 


T-l 


Ek + 1 


j2HEi{Tk + i)) Y1 = 

k =0 i=rfe+l 

T-l 

Y,I{Enrk + l)) 


Lfc=o 


T-l 


<1+E ^/(L;f(rfc + 1)) 

.k=l 
T-l 

< 1 + ^ ex.p{-kd{xi, Hi)) 


k=l 


< 1 + 


1 


Hi 


(58) 


Since E^{t) doesn’t change unless arm i is pulled, and I{i{t) = i) = 1, so (f58]l is equal to 


(59) 

(60) 
(61) 
(62) 


(63) 




Where the second last inequality is from Chernoff bound and d{x, y) = x In | + (1 — x) In 
Similarly, [5] bound 


T _ 

Y,Pm = i,EUt),Ei^{t)) < U{T) + 1 (64) 

t=i 

where Li{T) = ^ Together with this three bounds and a choice of x, and yi for all i & A, we 

can get a problem independent bound 0 {VNT InN"). 

4 Generalized Thompson Sampling 

4.1 Problem Formulation 

Generalized Thompson Sampling [4] is a contextual bandit problem, it is similar to expert-learning 
framework, and include Thompson Sampling as a special case. Let A and A be the set of context 
and arms, and let iL = |.4.|. At time step t E the decision-maker observes the context 

Xi E T and selects an arm at E A. Then it receives reward r* E {0,1}, with expectation /x(xi,at). 
In [3] the reward is binary, but it is easy to generalize to continuous space. Different from classic 
Thompson Sampling algorithm. Generalized Thompson Sampling allows the decision-maker to have 
access to a set of experts £ = {£ 1 , 82 , ■■■,£n}, each £ makes predicts about the average reward 
H{xt,at). Let ft be the associated prediction function of expert £i, the arm-selection policy is 
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£i{x) = maxag_ 4 /j(x, a). Each expert could be a generalized linear model or other prediction 
model. The regret is defined as 

N 

E [Regret(T)l = max n(xt, TRxj)) — E 
— t=i 

That is, we are competing with the best expert. 

4.2 Algorithm 

Generalized Thompson Sampling is described in Algorithm [2j We can see that it updates the weight 
Wi^t+i by Wi^t+i oc where i is the loss function. The term ‘Generalized’ 

in ‘Generalized Thompson Sampling’ means that we can use different types of loss functions when 
updating Wi. [3] described two loss functions: logarithmic loss and square loss. Logarithmic loss 
is defined as i{r,r) = l(r = l)lnl/r + l(r = 0 )ln(l/(l — f)), and square loss is defined as 
£(f,r) = (f — r)^. In next section, we will show that if the loss function is logarithmic loss, then 
Generalized Thompson Sampling takes the form of Thompson Sampling. 

Algorithm 2 Generalized Thompson Sampling 
Require: r/ > 0, 7 > 0, £ 1 , ■..,£nj prior p 

For each expert i = 1,..., N set mi = p, Wi = ||mi||i 
for t = 1, ..., T do 

Receive context x* G X 
for arm a = 1,..., K do 

end for 

Select arm at based on P{a), observe reward rt, update weights: 

Vi : Wi^t+i = exp(-r/f(/j(xt,at),rt));ITi+i = 

end for 


T 




,i=l 


(65) 


4.3 Connection with Expert-Learning and Thompson Sampling 

Generalized Thompson sampling has the format of expert exponential weighting, however it also 
fits Thompson sampling framework, there are two ways to see this, and in both ways we need to 
assume the loss is log loss, that is if an expert / predicts that the probability of r = 1 is and the 
probability of r = 0 is 1 — pi, then the log loss of expert / is In A when reward is 1, and is In 
when reward is 0 . 

The first way to see this: we can think of Generalized Thompson Sampling as maintaining a 
posterior distribution of the weight of each expert, denoted as wt- This posterior distribution may 
be interpreted as the posterior probability that ft is the reward-maximizing expert. The update 
rule, for one step, is 


Wi^t+i oc Wi^tex.p{-£{fi{xt,at),rt)) 


oc Wi^t exp(— ln( 


1 


p{rt\xt,at) 


)) 


oc Wi^tPi{rt\xt,at) 
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( 66 ) 

(67) 

( 68 ) 










Let f* = fi be the event that fi is the reward-maximizing expert. From Bayesian rule we have, for 
one step 


P{ft+i = fi\xuaurt) oc P{r = rt\f* = fi,xt,at)P{f* = fi) 

oc Wi^tPi{rt\xt,at) 


(69) 

(70) 


We can see that the update rule and Bayesian rule take the same format. Finally, the posterior 
distribution on ff is 


P{ft* = fi) 


Wi,t 


(71) 


We can also see it from a second way. Let yt, xt and r* be the selected arm, context and reward 
in time t, y*, x*, r* be the selected arms, contexts and rewards in time respectively, then 

from Bayesian rule we have 




Pjy^ ^,yt,r^ ^a;*) 

p(yt-l, 2;i) 

X*) 

p{y^~^\P~^, x^) 


(72) 

(73) 


Assume we have a uniform mixture of the distribution dehned by the experts (Note that we are 
assuming uniform mixture over y* and not yt), then we have 


p{y^\r^ i,xi) ^ E//(yV* ^^x^) 
p{y^~^\P~^, x^) X*) 


From update rule we have: 


p{yt\y^ ^,X^) 


Jf,fWf,t-if{yt\xt) 


(75) 


J2fWf,t-if{yt\xt) 

E/ ^0 Pf{ri\y^,x^) Pf{r 2 \y^,x‘^,r^)... p/(rt_i|y*“\x*“\r*”^) f{yt\xt) 
E/'^o Pf{ri\y^,x^) pf{r2\y‘^,x'^,P)... pf{rt-i\y^-^,x^-^,p-‘^) 


T.ff{y\x^ ^r* ^\xt) 

t-i j.t-1 


E//(y* 

Hf f{y^\x\r^~^) 

rt-1 


J 2 ffiy' 


y-^) 


(77) 

(78) 

(79) 


So we can see that the update rule and Bayesian rule have the same format. However notice that 
in this view we are conditioned on x* while in the first view the posterior distribution of wt is 
conditioned on the x*“^. 
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4.4 Regret 


The basic idea of the derivation is that we assume a connection between the loss function and 
the regret: Define immediate regret Aj(x) = fj,{x,£*{x)) — fi{x,£i{x)), shifted loss of expert i 

^j(r|x,a) = i{fi{x,a),r) - i{f*{x,a),r), and average shifted loss I=Ert,at ^i'Wi,thirt\xt,at) , we 

assume there is a constant ki, such that Ai{xt) < kw/Tt. Also we make use of the self-boundedness 


property of the loss function; E,. 
is bounded by the first moment o: 


li{r\x,af 


< ko^r 


li{r\x,a) 


, which means the second moment 


: the shifted loss. Then we can bound the expected regret by 


V^ 4 A: 2 (e - 2)ki{l - 'y)\ T ■ In — -h 7 T 
V Pi 


(80) 


Different loss has different choice of ki and k 2 , and [1] proved that with square loss the ex¬ 
pected regret bound is with logarithmic loss the expected regret bound is 

0(yh^Ti^2/32-2/3)_ 
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